spaCy
is a natural language processing library for Python library that includes a basic model capable of recognising (ish!) names of people, places and organisations, as well as dates and financial amounts.
According to the spaCy
entity recognition documentation, the built in model recognises the following types of entity:
PERSON
People, including fictional.NORP
Nationalities or religious or political groups.FACILITY
Buildings, airports, highways, bridges, etc.ORG
Companies, agencies, institutions, etc.GPE
Countries, cities, states. (That is, Geo-Political Entitites)LOC
Non-GPE locations, mountain ranges, bodies of water.PRODUCT
Objects, vehicles, foods, etc. (Not services.)EVENT
Named hurricanes, battles, wars, sports events, etc.WORK_OF_ART
Titles of books, songs, etc.LANGUAGE
Any named language.LAW
A legislation related entity(?)Quantities are also recognised:
DATE
Absolute or relative dates or periods.TIME
Times smaller than a day.PERCENT
Percentage, including "%".MONEY
Monetary values, including unit.QUANTITY
Measurements, as of weight or distance.ORDINAL
"first", "second", etc.CARDINAL
Numerals that do not fall under another type.Custom models can also be trained, but this requires annotated training documents.
In [ ]:
#!pip3 install spacy
In [2]:
from spacy.en import English
parser = English()
In [3]:
example='''
That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in
York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland;
acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase
over the same period in 2016; further notes 156 of these job losses will be in York, a city that in
the last six months has seen 2,000 job losses announced and has become the most inequitable city outside
of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of
triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with
the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and
production machinery, has risen due to the weakness of the pound and the uncertainty over the UK's future
relationship with the single market and customs union; and calls on the Government to intervene and work
with hon. Members, trades unions GMB and Unite and the company to avert these job losses now and prevent
further job losses across Nestlé.
'''
In [4]:
#Code "borrowed" from somewhere?!
def entities(example, show=False):
if show: print(example)
parsedEx = parser(example)
print("-------------- entities only ---------------")
# if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this:
ents = list(parsedEx.ents)
tags={}
for entity in ents:
#print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))
term=' '.join(t.orth_ for t in entity)
if ' '.join(term) not in tags:
tags[term]=[(entity.label, entity.label_)]
else:
tags[term].append((entity.label, entity.label_))
print(tags)
In [5]:
entities(example)
In [6]:
q= "Bob Smith was in the Houses of Parliament the other day"
entities(q)
Note that the way that models are trained typically realises on cues from the correct capitalisation of named entities.
In [8]:
entities(q.lower())
In [3]:
#!pip3 install polyglot
##Mac ??
#!brew install icu4c
#I found I needed: pip3 install pyicu, pycld2, morfessor
##Linux
#apt-get install libicu-dev
In [5]:
!polyglot download embeddings2.en ner2.en
In [109]:
from polyglot.text import Text
text = Text(example)
text.entities
Out[109]:
In [11]:
Text(q).entities
Out[11]:
Sometimes we may have a list of entities that we wish to match in a text. For example, suppose we have a list of MPs' names, or a list of ogranisations of subject terms identified in a thesaurus, and we want to tag a set of documents with those entities if the entity exists in the document.
To do this, we can search a text for strings that exactly match any of the specified terms or where any of the specified terms match part of a longer string in the text.
Naive implementations can take a signifcant time to find multiple strings within a tact, but the Aho-Corasick algorithm will efficiently match a large set of key values within a particular text.
In [ ]:
## The following recipe was hinted at via @pudo
#!pip3 install pyahocorasick
#https://github.com/alephdata/aleph/blob/master/aleph/analyze/corasick_entity.py
First, construct an automaton that identifies the terms you want to detect in the target text.
In [22]:
from ahocorasick import Automaton
A=Automaton()
A.add_word("Europe",('VOCAB','Europe'))
A.add_word("European Union",('VOCAB','European Union'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson'))
A.add_word("Boris",('PERSON','Boris Johnson'))
A.add_word("boris johnson",('PERSON','Boris Johnson (LC)'))
A.make_automaton()
In [30]:
q2='Boris Johnson went off to Europe to complain about the European Union'
for item in A.iter(q2):
print(item, q2[:item[0]+1])
Once again, case is important.
In [33]:
q2l = q2.lower()
for item in A.iter(q2l):
print(item, q2l[:item[0]+1])
We can tweak the automata patterns to capture the length of the string match term, so we can annotate the text with matches more exactly:
In [35]:
A=Automaton()
A.add_word("Europe",(('VOCAB', len("Europe")),'Europe'))
A.add_word("European Union",(('VOCAB', len("European Union")),'European Union'))
A.add_word("Boris Johnson",(('PERSON', len("Boris Johnson")),'Boris Johnson'))
A.add_word("Boris",(('PERSON', len("Boris")),'Boris Johnson'))
A.make_automaton()
In [55]:
for item in A.iter(q2):
start=item[0]-item[1][0][1]+1
end=item[0]+1
print(item, '{}*{}*{}'.format(q2[start-3:start],q2[start:end],q2[end:end+3]))
Imagine a situation where we have managed to extract arbitrary named entities from a text, but they do not match strings in a specified list in an exact or partially exact way. Our next step might be to attempt to further match those entities in a fuzzy way with entities in a specified list.
fuzzyset
The python fuzzyset
package will try to match a specified string to similar strings in a list of target strings, returning a single item from a specified target list that best matches the provided term.
For example, if we extract the name Boris Johnstone in a text, we might then try to further match that string, in a fuzzy way, with a list of correctly spelled MP names.
A confidence value expresses the degree of match to terms in the fuzzy match set list.
In [80]:
import fuzzyset
fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
fz.add(l)
#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
Out[80]:
fuzzywuzzy
If we want to try to find a fuzzy match for a term within a text, we can use the python fuzzywuzzy
library. Once again, we specify a list of target items we want to try to match against.
In [19]:
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
In [20]:
terms=['Houses of Parliament', 'Diane Abbott', 'Boris Johnson']
q= "Diane Abbott, Theresa May and Boris Johnstone were in the Houses of Parliament the other day"
process.extract(q,terms)
Out[20]:
By default, we get match confidence levels for each term in the target match set, although we can limit the response to a maximum number of matches:
In [21]:
process.extract(q,terms,scorer=fuzz.partial_ratio, limit=2)
Out[21]:
A range of fuzzy match scroing algorithms are supported:
WRatio
- measure of the sequences' similarity between 0 and 100, using different algorithmsQRatio
- Quick ratio comparison between two stringsUWRatio
- a measure of the sequences' similarity between 0 and 100, using different algorithms. Same as WRatio but preserving unicodeUQRatio
- Unicode quick ratioratio
- token_sort_ratio
- a measure of the sequences' similarity between 0 and 100 but sorting the token before comparingpartial_token_set_ratio
- partial_token_sort_ratio
- ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparingMore usefully, perhaps, is to return items that match above a particular confidence level:
In [22]:
process.extractBests(q,terms,score_cutoff=90)
Out[22]:
However, one problem with the fuzzywuzzy
matcher is that it doesn't tell us where in the supplied text string the match occurred, or what string in the text was matched.
The fuzzywuzzy
package can also be used to try to deduplicate a list of items, returning the longest item in the duplicate list. (It might be more useful if this is optionally the first item in the original list?)
In [54]:
names=['Diane Abbott', 'Boris Johnson','Boris Johnstone','Diana Abbot', 'Boris Johnston','Joanna Lumley']
In [55]:
process.dedupe(names, threshold=80)
Out[55]:
It might also be useful to see the candidate strings associated with each deduped item, treating the first item in the list as the canonical one:
In [131]:
import hashlib
clusters={}
fuzzed=[]
for t in names:
fuzzyset=process.extractBests(t,names,score_cutoff=85)
#Generate a key based on the sorted members of the set
keyvals=sorted(set([x[0] for x in fuzzyset]),key=lambda x:names.index(x),reverse=False)
keytxt=''.join(keyvals)
key=hashlib.md5(keytxt).hexdigest()
if len(keyvals)>1 and key not in fuzzed:
clusters[key]=sorted(set([x for x in fuzzyset]),key=lambda x:names.index(x[0]),reverse=False)
fuzzed.append(key)
for cluster in clusters:
print(clusters[cluster])
As well as running as a browser accessed application, OpenRefine also runs as a service that can be accessed from Python using the refine-client.py client libary.
In particular, we can use the OpenRefine service to cluster fuzzily matched items within a list of items.
In [4]:
#!pip install git+https://github.com/PaulMakepeace/refine-client-py.git
#NOTE - this requires a python 2 kernel
In [25]:
#Initialise the connection to the server using default or environment variable defined server settings
#REFINE_HOST = os.environ.get('OPENREFINE_HOST', os.environ.get('GOOGLE_REFINE_HOST', '127.0.0.1'))
#REFINE_PORT = os.environ.get('OPENREFINE_PORT', os.environ.get('GOOGLE_REFINE_PORT', '3333'))
from google.refine import refine, facet
server = refine.RefineServer()
orefine = refine.Refine(server)
In [133]:
#Create an example CSV file to load into a test OpenRefine project
project_file = 'simpledemo.csv'
with open(project_file,'w') as f:
for t in ['Name']+names+['Boris Johnstone']:
f.write(t+ '\n')
!cat {project_file}
In [134]:
p=orefine.new_project(project_file=project_file)
p.columns
Out[134]:
OpenRefine supports a range of clustering functions:
- clusterer_type: binning; function: fingerprint|metaphone3|cologne-phonetic
- clusterer_type: binning; function: ngram-fingerprint; params: {'ngram-size': INT}
- clusterer_type: knn; function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}
In [136]:
clusters=p.compute_clusters('Name',clusterer_type='binning',function='cologne-phonetic')
for cluster in clusters:
print(cluster)
In [ ]:
#!pip3 install gensim
In [53]:
#https://github.com/sgsinclair/alta/blob/e5bc94f7898b3bcaf872069f164bc6534769925b/ipynb/TopicModelling.ipynb
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from gensim import corpora, models
def get_lda_from_lists_of_words(lists_of_words, **kwargs):
dictionary = corpora.Dictionary(lists_of_words) # this dictionary maps terms to integers
corpus = [dictionary.doc2bow(text) for text in lists_of_words] # create a bag of words from each document
tfidf = models.TfidfModel(corpus) # this models the significance of words using term frequency inverse document frequency
corpus_tfidf = tfidf[corpus]
kwargs["id2word"] = dictionary # set the dictionary
return models.LdaModel(corpus_tfidf, **kwargs) # do the LDA topic modelling
def print_top_terms(lda, num_terms=10):
txt=[]
num_terms=min([num_terms,lda.num_topics])
for i in range(0, num_terms):
terms = [term for term,val in lda.show_topic(i,num_terms)]
txt.append("\t - top {} terms for topic #{}: {}".format(num_terms,i,' '.join(terms)))
return '\n'.join(txt)
To start with, let's create a list of dummy documents and then generate word lists for each document.
In [78]:
docs=['The banks still have a lot to answer for the financial crisis.',
'This MP and that Member of Parliament were both active in the debate.',
'The companies that work in finance need to be responsible.',
'There is a reponsibility incumber on all participants for high quality debate in Parliament.',
'Corporate finance is a big responsibility.']
#Create lists of words from the text in each document
from nltk.tokenize import word_tokenize
docs = [ word_tokenize(doc.lower()) for doc in docs ]
#Remove stop words from the wordlists
from nltk.corpus import stopwords
docs = [ [word for word in doc if word not in stopwords.words('english') ] for doc in docs ]
Now we can generate the topic models from the list of word lists.
In [82]:
topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
The model is randomised - if we run it again we are likely to get a different result.
In [83]:
topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
In [ ]: